\name{Bruvo.distance}
\alias{Bruvo.distance}
\title{Genetic Distance Metric of Bruvo et al}
\description{
  This function calculates the distance between two
    individuals at one microsatellite locus using the method of Bruvo et
    al. (2004).
}
\usage{Bruvo.distance(genotype1, genotype2, maxl=9, usatnt=2, missing=-9)}
\arguments{
  \item{genotype1}{A vector of alleles for one individual at one
      locus.  Allele length is in nucleotides.  Each unique allele
      corresponds to one element in the vector, and the vector is no
      longer than it needs to be to contain all unique alleles for this
      individual at this locus.}
  \item{genotype2}{A vector of alleles for another individual at the
      same locus.}
  \item{maxl}{If both individuals have more than this number of
      alleles at this locus, NA is returned instead of a numerical
      distance.}
  \item{usatnt}{Length of the repeat at this locus.  For example
      usatnt=2 for dinucleotide repeats, and usatnt=3 for trinucleotide
      repeats.  If the alleles in genotype1 and genotype2 are expressed
      in repeat count instead of nucleotides, set usatnt=1.}
  \item(missing}{A numerical value that, when in the first allele
      position, indicates missing data. NA is returned if this value is
      found in either genotype, or if either genotype has a length of zero.}
}
\details{
  Since allele copy number is frequently unknown in polyploid
  microsatellite data, Bruvo et al. developed a measure of genetic
  distance similar to band-sharing indices used with dominant data,
  but taking into account mutational distances between alleles.  A
  matrix is created containing all differences in repeat count between
  the alleles of two individuals at one locus.  These differences are
  then geometrically transformed to reflect the probabilities of
  mutation from one allele to another.  The matrix is then searched to
  find the minimum sum if each allele from one individual is paired to
  one allele from the other individual.  This sum is divided by the
  number of alleles per individual.

  If one genotype has more alleles than the other, 'virtual alleles' must
  be created so that both genotypes are the same length.  There are
  three options for the value of these virtual alleles, but
  Bruvo.distance only implements the simplest one, assuming that it is
  not known whether differences in ploidy arose from genome addition or
  genome loss.  Virtual alleles are set to infinity, such that the
  geometric distance between any allele and a virtual allele is 1.
}
\value{
  A number ranging from 0 to 1, with 0 indicating identical
    genotypes, and 1 being a theoretical maximum distance if all alleles
    from genotype1 differed by an infinite number of repeats from all
    alleles in genotype2.  NA is returned if both genotypes have
    more than maxl alleles or
    if either genotype has the symbol for missing data as its first allele.
}
\references{
  Bruvo, R., Michiels, N. K., D'Sousa, T. G., and Schulenberg, H. (2004)
  A simple method for calculation of microsatellite genotypes
  irrespective of ploidy level. _Molecular Ecology_ *13*, 2101-2106.
}
\note{
  The processing time is a function of the factorial of the number of
  alleles, since each possible combination of allele pairs must be
  evaluated.  For genotypes with a sufficiently large number of alleles,
  it may be more efficient to estimate distances manually by creating
  the matrix in Excel and visually picking out the shortest distances
  between alleles.  This is the purpose of the maxl argument.  On my
  personal computer, if both genotypes had more than nine alleles, the
  calculation could take an hour or more, and so this is the default
  limit.  In this case, Bruvo.distance returns NA.
}
\seealso{
}
\examples{
  Bruvo.distance(c(202,206,210,220),c(204,206,216,222))
  Bruvo.distance(c(202,206,210,220),c(204,206,216,222),usatnt=4)
  Bruvo.distance(c(202,206,210,220),c(204,206,222))
  Bruvo.distance(c(202,206,210,220),c(204,206,216,222),maxl=3)
  Bruvo.distance(c(202,206,210,220),c(-9))
}
\author{Lindsay V. Clark}
\keyword{arith}

\name{read.GeneMapper}
\alias{read.GeneMapper}
\title{Read GeneMapper Genotypes Tables}
\description{
  Given a vector of filepaths to tab-delimited text files containing
  genotype data in the ABI GeneMapper Genotypes Table format, read.GeneMapper produces
  a two-dimensional list of genotypes that can be read by other
  functions in the polysat package.
}
\usage{
  read.GeneMapper(infiles, missing=-9)
}
\arguments{
  \item{infiles}{A character vector of paths to the files to be read.}
  \item{missing}{A numerical value used to indicate missing data for a
    given sample and locus.}
}
\value{
  The object produced is a two-dimensional list of vectors representing
  the genotypes.  Samples are
  represented by the first dimension and loci by the second dimension.
  The names of the samples and loci are taken from the Sample Names and
  Markers columns, respectively, of the GeneMapper files.  The vectors
  at each position of the list are numerical and are only as long as
  needed to contain each allele (for that sample and locus) as one element.
}
\details{
  read.GeneMapper can read the genotypes tables that are exported
  by the Applied Biosystems GeneMapper software.  The only alterations
  to the files that the user may have to make are 1) delete
  any rows with missing data or fill in a numerical missing
  data symbol
  of your choice (such as -9) in the first allele slot for that row, 2)
  make sure that all allele names are numeric representations of fragment
  length (no question marks or dashes), and 3) put sample names into the
  Sample Name column, if the names that you wish to use in analysis are
  not already there.  Each file should have the standard header row
  produced by the software.  If any sample has more than one genotype
  listed for a given locus, only the last genotype listed will be used.

  The file format is simple enough that the user can easily create files
  manually if GeneMapper is not the software used in allele calling.
  The files are tab-delimited text files.  There should be a header row
  with column names.  The column labeled \dQuote{Sample Name} should contain
  the names of the samples, and the column labeled \dQuote{Marker} should
  contain the names of the loci.  You can have as many or as few columns as
  needed to contain the alleles, and each of these columns should be
  labeled \dQuote{Allele X} where X is a number unique to each column.  Row
  labels and any other columns are ignored.  For any given sample, each
  allele is listed only once and is given as an integer that is the
  length of the fragment in nucleotides.  Alleles are separated by
  tabs.  If you have more allele columns than alleles for any given
  sample, leave the extra cells blank so that read.table will read them
  as NA.  Example data files in this format are included in the package.

  read.GeneMapper will read all of your data at once.  It takes as its
  first argument a character vector containing paths to all of the files
  to be read.  How the data are distributed over these files does not
  matter.  The function finds all unique sample names and all unique
  markers across all the files, and automatically puts a missing data
  symbol into the list if a particular sample and locus combination is
  not found.  Rows in which all allele cells are blank should NOT be
  included in the input files; either delete these rows or put the
  missing data symbol into the first allele cell.

  Sample and locus names must be consistent within and across the
  files.  The list that is produced is indexed by these names.  For
  example, if the object produced was called mygenotypes,
  mygenotypes[[\dQuote{AB1},\dQuote{ABC5}]] would be a vector containing alleles for
  sample AB1 at locus ABC5.  mygenotypes[,\dQuote{ABC5}] would be a list of all
  genotypes at locus ABC5, and mygenotypes[\dQuote{AB1},] would be a list of
  all genotypes for sample AB1.
}
\references{
  \url{http://www.appliedbiosystems.com/genemapper}
}
\seealso{
}
\examples{
  \dontrun{
    myinfiles<-c("data\\sample CBA15.txt","data\\sample
    CBA23.txt","data\\sample CBA28.txt")
    mygenotypes<-read.GeneMapper(myinfiles)

    #Look at the object produced.  Alleles are not listed but you can
    #see that the array was filled.
    mygenotypes

    #Look at the genotype of individual FCR5.
    mygenotypes["FCR5",]

    #Correct one of the genotypes
    mygenotypes[["FCR5","RhCBA15"]]<-c(208)
  }
}
\author{Lindsay V. Clark}
\keyword{file}

\name{distance.matrix.1locus}
\alias{distance.matrix.1locus}
\title{Pairwise Genetic Distances at One Locus}
\description{
  Given all genotypes for one locus, create a pairwise genetic distance matrix.
}
\usage{
distance.matrix.1locus(gendata, distmetric=Bruvo.distance,
progress=TRUE, ...)
}
\arguments{
  \item{gendata}{A list of vectors, where each vector contains all the
    alleles in the genotype of one sample at this locus.  names(gendata)
    should be the sample names corresponding to the genotypes.}
  \item{distmetric}{This is the function that will be used to calculate
    each pairwise distance.  This should be a function that, given two
    vectors of alleles, returns a numerical distance.}
  \item{progress}{If TRUE, distance.matrix.1locus will print the names
    of sample pairs as it finishes each calculation with distmetric.
    For large datasets, this is intended so that the user can monitor
    the progress of the calculations.}
  \item{...}{These arguments will be passed to distmetric.  For example,
    with Bruvo.distance, maxl, usatnt, or missing may be used.}
}
\value{
  A symmetrical matrix of distances, with the names of samples used as
  row and column names.
}
\details{
  Given a list of genotypes at one locus, \sQuote{distance.matrix.1locus}
  produces a symmetrical matrix of pairwise distances between
  genotypes.  If using a polysat genotype object such as that produced
  by \sQuote{read.GeneMapper}, the \sQuote{gendata} argument should be one of the
  sublists, for example mygenotypedata[,\dQuote{locus1}].  The measure of distance
  can be any that is provided with polysat, or any function written by
  the user, so long as it takes genotypes as vectors of alleles (or any
  other type of object that is given as elements of the list \sQuote{gendata})
  and returns a numerical distance.  Any arguments that need to be
  passed to the distmetric function can be given to
  \sQuote{distance.matrix.1locus}.  To save processing time, each pairwise
  distance is only calculated once and then written to both locations in
  the matrix simultaneously.  The user also has the option to have each
  pair of sample names printed after the distance is calculated, so that
  progress can be monitored if evaluation is expected to take a long time.
}
\references{
}
\seealso{
  \item{Bruvo.distance}
  \item{read.GeneMapper}
}
\examples{
mygenotypes<-list(IND1=c(124,127,133),IND2=c(130,139,145,151),IND3=c(118,127,133,154))
distance.matrix.1locus(mygenotypes,usatnt=3)
}
\author{Lindsay V. Clark}
\keyword{array}

\name{mean.distance.matrix}
\alias{mean.distance.matrix}
\title{Mean Pairwise Distance Matrix}
\description{
  Given a two-dimensional list of genotypes, indexed by sample and
  locus, mean.distance.matrix produces a
  symmetrical matrix of pairwise distances between samples, averaged
  across all loci.  An array of all distances prior to averaging may
  also be produced.
}
\usage{
mean.distance.matrix(gendata, samples=dimnames(gendata)[[1]],
loci=dimnames(gendata)[[2]], all.distances=FALSE, usatnts=NULL, ...)
}
\arguments{
  \item{gendata}{A two-dimensional list of genotypes, such as that produced by
    read.GeneMapper.  The first dimension is an index of samples, the
    second dimension is an index of loci, and the elements are numerical
    vectors containing the alleles as elements.}
  \item{samples}{A character vector of samples to be analyzed.  These
    should be all or a subset of the sample names used in gendata.}
  \item{loci}{A character vector of loci to be analyzed.  These should
    be all or a subset of the loci names used in gendata.}
  \item{all.distances}{If FALSE, only the mean distance matrix will be
    returned.  If TRUE, a list will be returned containing an array of
    all distances by locus and sample as well as the mean distance matrix.}
  \item{usatnts}{A numerical vector that contains the length of
    nucleotide repeats for each
    locus.  For example, 3 would be used to indicate a locus with
    trinucleotide repeats.  1 should be used if alleles are written in
    terms of repeat number, not fragment length in nucleotides.
    names(usatnts) should be the same as those used in names(gendata)
    (the names of the loci).  This argument can be omitted if repeat
    length is irrelevant to the distance metric.}
  \item{...}{If distmetric or progress are given here they will be
    passed to distance.matrix.1locus.  Any other arguments will be
    passed to distmetric.}
}
\value{
  A symmetrical matrix containing pairwise distances between all
  samples, averaged across all loci.  Row and column names of the matrix
  will be the sample names provided in the \sQuote{samples} argument.  If
  all.distances=TRUE, a list will be produced containing the above
  matrix as well as a three-dimensional array containing all distances
  by locus and sample.  The array is the first item in the list, and the
  mean matrix is the second.
}
\details{
  \sQuote{mean.distance.matrix} uses \sQuote{distance.matrix.1locus} once for each locus
  to be analyzed, then averages values across these matrices.  Any
  arguments that need to be passed to \sQuote{distance.matrix.1locus} may be
  given to \sQuote{mean.distance.matrix}.  If the loci are of different repeat
  types and the type of repeat is important for the distance metric
  being used (e.g. \sQuote{Bruvo.distance}), the usatnts argument can be used to
  pass a different \sQuote{usatnt} argument to \sQuote{distmetric}
  depending on the locus.

  Because the user may want to omit samples or loci, the samples and
  loci arguments are given for convenient indexing of the data to be
  analyzed.  If gendata contains only the data that the user wants to
  analyze, the user can simply omit these arguments.

  Missing data must be represented by the missing data symbol, rather
  than NA.
}
\references{
}
\seealso{
  \item{distance.matrix.1locus}
  \item{read.GeneMapper}
}
\examples{
# create a list of genotype data
mygendata <-
  array(list(c(124,128,138),c(122,130,140,142),c(122,132,136),c(122,134,140),
             c(203,212,218),c(197,206,221),c(215),c(200,218),
             c(140,144,148,150),c(-9),c(146,150),c(152,154,158),
             c(233,236,280),c(-9),c(-9),c(-9))
   ,dim=c(4,4),dimnames=list(c("ind1","ind2","ind3","ind4"),
  c("locus1","locus2","locus3","locus4")))

# make index vectors of data to use
myloci <- c("locus1","locus2","locus3")
mysamples <- c("ind1","ind2","ind4")

# locus1 and locus3 have dinucleotide repeats, and locus2 has
# trinucleotide repeats
myusatnts <- c(2,3,2)
names(myusatnts) <- myloci

mean.distance.matrix(mygendata, mysamples, myloci, all.distances=TRUE, usatnts=myusatnts)
}
\author{Lindsay V. Clark}
\keyword{array}

\name{write.Structure}
\alias{write.Structure}
\title{Write Genotypes in Structure 2.3 Format}
\description{
  Given genotypes in the form of a two-dimensional list of vectors,
  write.Structure produces a text file of the genotypes in a format
  readable by Structure 2.2 and higher.  The user specifies the overall
  ploidy of the file as well as the ploidy of each sample.
}
\usage{
write.Structure(gendata, ploidy, outfile,
samples=dimnames(gendata)[[1]], loci=dimnames(gendata)[[2]],
indploidies=rep(ploidy,times=length(samples)),
extracols=NULL, missing=-9)
}
\arguments{
  \item{gendata}{Genotypes stored as a two-dimensional list of vectors, such as
    produced by read.GeneMapper.}
  \item{ploidy}{PLOIDY for Structure, i.e. how many rows per individual
    to write.}
  \item{outfile}{A character string of the file path where the file
    should be written to.}
  \item{samples}{An optional character vector listing the names of samples to be
    written to the file.}
  \item{loci}{An optional character vector listing the names of the loci to be
    written to the file.}
  \item{indploidies}{A numerical vector containing the ploidy of each
    sample. names(indploidies) should correspond to \sQuote{samples}.}
  \item{extracols}{An array, with the first dimension names
    corresponding to \sQuote{samples}, of PopData, PopFlag, LocData, Phenotype,
    or other values to be included in the extra columns in the file.}
  \item{missing}{The number used to indicate missing data.}
}
\value{
  No value is returned, but instead a file is written at the path specified.
}
\details{
  Structure 2.2 and higher can process polyploid microsatellite data,
  although 2.3.3 or higher is recommended for its improvements on
  polyploid handling.  The input format of Structure requires that
  each locus take up one column and that each individual take up as
  many rows as the parameter PLOIDY.  Because of the multiple rows per
  sample, each sample name must be duplicated, as well as any
  population, location, or phenotype data.  Partially heterozygous
  genotypes also must have one arbitrary allele duplicated up to the
  ploidy of the sample, and samples that have a lower ploidy than that
  used in the file (for mixed polyploid data sets) must have a missing
  data symbol inserted up to fill in the extra rows.  Additionally, if
  some samples have more alleles than PLOIDY (if you are using a lower
  PLOIDY to save processing time, or if there are extra alleles from
  scoring errors), some alleles must be randomly removed from the data.
  \sQuote{write.Structure} performs this duplication, insertion, and random
  deletion of data.

  The argument samples contains all of the sample names to be written to
  the file, and is used to index \sQuote{gendata}, \sQuote{indploidies},
  and \sQuote{extracols}.
  These sample names will also be used as row names in the Structure
  file.  Each sample name should only be in the vector sample once,
  because \sQuote{write.Structure} will duplicate the sample names a number of
  times as dictated by \sQuote{ploidy}.  Likewise, \sQuote{indploidies}
  and \sQuote{extracols}
  only need to contain data for each sample once.  If samples isn't
  specified by the user it will be extracted from \sQuote{gendata}.

  In writing genotypes to the file, \sQuote{write.Structure} compares the number
  of alleles in the genotype, the ploidy of the sample as stored in
  \sQuote{indploidies}, and the ploidy of the file as stored in \sQuote{ploidy}, and does
  one of six things (for a given sample x and locus loc):

  1) If indploidies[x] is greater than or equal to ploidy, and
  length(gendata[[x,loc]]) is equal to ploidy, the genotype data is
  used as is.
  2) If indploidies[x] is greater than or equal to ploidy, and
  length(gendata[[x,loc]]) is less than ploidy, the first allele is
  duplicated as many times as necessary for there to be as many alleles
  as ploidy.
  3) If indploidies[x] is greater than or equal to ploidy, and
  length(gendata[[x,loc]]) is greater than ploidy, a random sample of
  the alleles, without replacement, is used as the genotype.
  4) If indploidies[x] is less than ploidy, and
  length(gendata[[x,loc]]) is equal to indploidies[x], the genotype
  data is used as is and missing data symbols are inserted in the extra
  rows.
  5) If indploidies[x] is less than ploidy, and
  length(gendata[[x,loc]]) is less than indploidies[x], the first
  allele is duplicated as many times as necessary for there to be as
  many alleles as indploidies[x], and missing data symbols are inserted
  in the extra rows.
  6) If indploidies[x] is less than ploidy, and
  length(gendata[[x,loc]]) is greater than indploidies[x], a random
  sample, without replacement, of indploidies[x] alleles is used, and
  missing data symbols are inserted in the extra rows. (Alleles are
  removed even though there is room for them in the file.)

  Two of the header rows that are optional for Structure are written by
  \sQuote{write.Structure}.  These are \sQuote{Marker Names}, containing the names of loci
  supplied in gendata, and \sQuote{Recessive Alleles}, which contains the missing
  data symbol once for each locus.  This indicates to the program that
  all alleles are codominant with copy number ambiguity.

  The output file requires a few small modifications, done in a text
  editor or spreadsheet software, in order to be read
  by Structure.  In the upper left corner the words \dQuote{rowlabel} and
  \dQuote{missing} should be deleted.  Likewise the first and second rows for
  any non-locus columns should be deleted if the extracols argument was
  used.  These should include the second dimension names used in
  \sQuote{extracols}, and zeros, respectively.
}
\references{
  \url{http://pritch.bsd.uchicago.edu/structure_software/release_versions/v2.3.3/structure_doc.pdf}
  \item{Hubisz, M. J., Falush, D., Stephens, M., and Pritchard, J. K. 2009.
  Inferring weak population structure with the assistance of sample
  group information.  _Molecular Ecology Resources_ 9:1322-1332.}
  \item{Falush, D., Stephens, M., and Pritchard, J. K.  2007.
  Inferences of population structure using multilocus genotype data:
  dominant markers and null alleles.  _Molecular Ecology Notes_ 7:574-578.}
}
\seealso{
  \item{read.GeneMapper}
}
\examples{
# input genotype data (this is usually done by reading a file)
mygendata<-array(list(c(100,102,106,108,114,118),c(102,110),
c(98,100,104,108,110,112,116),c(102,106,112,118),c(104,108,110),c(-9),
c(204),c(206,208,210,212,220,224,226),
c(202,206,208,212,214,218),c(200,204,206,208,212),c(-9),c(202,206)),
dim=c(6,2),dimnames=list(c("ind1","ind2","ind3","ind4","ind5","ind6"),c("locus1","locus2")))

# create a vector of sample names to be used.  Note that this excludes
#  ind6.
# Also note that this could be obtained as names(mygendata[[1]]).
mysamples<-c("ind1","ind2","ind3","ind4","ind5")

# create a vector of the ploidy of each sample.
# Note that some of the above genotypes have more or fewer alleles than
# the ploidy of the sample.
myploidies<-c(6,6,6,4,4)
names(myploidies)<-mysamples

# Create an array containing data for additional columns to be written
# to the file.  You might also prefer to just read this and the ploidies
# in from a file.
myexcols<-array(data=c(1,2,1,2,1,1,1,0,0,0),dim=c(5,2),
dimnames=list(mysamples, c("PopData","PopFlag")))

# Write the Structure file, with six rows per individual.
# Since outfile="", the data will be written to the console instead of a file.
write.Structure(mygendata, 6, "", samples=mysamples, indploidies=myploidies,
 extracols=myexcols)
}
\author{Lindsay V. Clark}
\keyword{file}

\name{estimate.ploidy}
\alias{estimate.ploidy}
\title{
Maximum and Mean Allele Count for Estimation of Ploidy
}
\description{
Given genotypes in the form of a two-dimensional list of vectors,
estimate.ploidy produces a two-dimensional array containing the maximum
number of alleles and the mean number of alleles for each sample, across
all loci.
}
\usage{
estimate.ploidy(gendata, samples = dimnames(gendata)[[1]], loci = dimnames(gendata)[[2]])
}
\arguments{
  \item{gendata}{
Genotypes stored as a two-dimensional list of vectors, in the format
produced by read.GeneMapper.
}
  \item{samples}{
    An optional character vector of samples to evaluate, which is a subset
    of dimnames(gendata)[[1]].
}
\item{loci}{
  An optional character vector of loci to use in the calculation, which
  is a subset of dimnames(gendata)[[2]].
}
}
\details{
  To assist the user in determining the ploidy of each sample,
  \sQuote{estimate.ploidy} looks at the genotype of the sample across
  all loci and returns the maximum number of alleles per locus.  The
  mean number of alleles is also returned to assist with checking for
  errors (for example, if an octoploid genotype was accidentally scored
  at one locus for a diploid sample).  Both of these are calculated
  using the \sQuote{length} function on the genotype vectors.

  The user may want to extract the vector containing the maximum number
  of alleles (for example, myploidies<-ploidyinfo[,1]) and then manually
  edit the values based on other knowledge of the organism.  This vector
  can then be used as the \sQuote{indploidies} argument for \sQuote{write.Structure}.
}
\value{
  An array with the second dimension of length 2 and the first dimension
  as long as \sQuote{samples}.  The rows are labeled by sample name and
  the columns are labeled \sQuote{max.alleles} and \sQuote{mean.alleles}.
}
\references{
%% ~put references to the literature/web site here ~
}
\author{
Lindsay V. Clark
}
\note{
%%  ~~further notes~~
}

%% ~Make other sections like Warning with \section{Warning }{....} ~

\seealso{
  \item{\code{\link{read.GeneMapper}}}
  \item{\code{\link{write.Structure}}}
}
\examples{
# Create a data set to analyze
mygendata <-
  array(list(c(124,128,138),c(122,130,140,142),c(122,132,136),c(122,134,140),
             c(203,212,218),c(197,206,221),c(215),c(200,218),
             c(140,144,148,150),c(-9),c(146,150),c(152,154,158))
   ,dim=c(4,3),dimnames=list(c("ind1","ind2","ind3","ind4"),
  c("locus1","locus2","locus3")))

# Run the function
estimate.ploidy(mygendata)

\keyword{arith}
\keyword{array}
